Skip to content

Conversation

@nbalepur
Copy link
Contributor

Re-adding the perplexity fixes because I forgot to merge it 🥲

SQA Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/sqa/task.py@sqa \
    --display plain \
    -T with_search_tools=false \
    -T simplified_eval=true \
    -T assess_jointly=true \
    --max-connections 16 \
    --max-samples 4 \
    --model ${generation_model} \
    --solver agent_baselines/solvers/sqa/formatted_perplexity.py@formatted_solver \
    -T sentence_wise_cit_eval=false \
    -T all_at_once=true \
    -T scorer_model='google/gemini-2.5-flash-preview-05-20' \
    -T split=${split} \
    -S search_context_size=high \
    -S require_snippets=false \
    -S reasoning_effort=high

Results:

global_avg/mean: 0.673  global_avg/stderr: 0.00455  ingredient_recall/mean: 0.924  ingredient_recall/stderr: 0.0107  answer_precision/mean: 0.94  answer_precision/stderr: 0.0108
citation_precision/mean: 0.462  citation_precision/stderr: 0.00338  citation_recall/mean: 0.367  citation_recall/stderr: 0.00751

LitQA2 Command:

uv run inspect eval /Users/nishantbalepur/Desktop/Repositories/agent-baselines/.venv/lib/python3.13/site-packages/astabench/evals/labbench/litqa2/task.py@litqa2_test \
        --display plain \
        --solver agent_baselines/solvers/sqa/perplexity_base.py@perplexity_solver \
        --model ${generation_model} \
        -T with_search_tools=false \
        -T with_native_search_tools=false \
        -S search_context_size=high

LitQA2 Results:

score_litqa2/precision: 0.9  score_litqa2/coverage: 0.8  is_correct/accuracy: 0.72  is_correct/stderr: 0.0522

Very similar to the results here: allenai/asta-bench#92

@amanpreet692 amanpreet692 self-requested a review October 21, 2025 15:00
Copy link
Contributor

@amanpreet692 amanpreet692 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@nbalepur nbalepur merged commit df4e915 into allenai:main Oct 21, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants